-
Notifications
You must be signed in to change notification settings - Fork 5
qprofiler1 / qvisualise - handle new fastq header #336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ore robust to deal with these 2 issues here, qprofiler was not aware of this new fastq header format and so each header became a unique instrument (all 28 million of them). Qvisualise would then run out of memory trying to deal with them all. HAvave added code to deprperly deal with the new format, in qprofiler, and perhaps more importantly, addeded some limits to the size of collections that can be handled by qvisualise.
delocalizer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. A couple of minor changes suggested.
| String [] firstElementParams = params[0].split(" "); | ||
| if (firstElementParams.length != 2) { | ||
| throw new UnsupportedOperationException("Incorrect header format encountered in parseFiveElementHeader. Expected '@ERR091788.3104 HSQ955_155:2:1101:13051:2071/2' but recieved: " + Arrays.deepToString(params)); | ||
| throw new UnsupportedOperationException("Incorrect header format encountered in parseFiveElementHeader. Expected '@ERR091788.3104 HSQ955_155:2:1101:13051:2071/2' but received: " + Arrays.deepToString(params)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and ll 299, 306 the Exception messages are a bit confusing because they refer not to a generic format but to a specific example as 'Expected'; is there a better wording for these?
| logger.info("supplied Xmx memory: " + xmxSize); | ||
| double ratio = (double)xmxSize / fileSize; | ||
| logger.info("memory / file size ratio: " + ratio); | ||
| if (ratio < 8) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add some comment about this heuristic?
| map.put(isInteger ? (T) Integer.valueOf(tallyItemElement.getAttribute("value")) | ||
| : (T) tallyItemElement.getAttribute("value") , new AtomicLong(count)); | ||
|
|
||
| if (tallyItemsNL.getLength() < 100000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
magic number -> constant
| } | ||
| } | ||
| } else { | ||
| log.warn(cycleElement.getTagName() + " has too many elements: " + tallyItemsNL.getLength() + " - will leave out of report"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and refer to the exceeded constant value here in the log message
…as per reviewer suggestions
fix(qvisualise/qprofiler): deal with new fastq header and make qvis more robust to deal with these
2 issues here,
qprofilerwas not aware of this newfastqheader format and so each header became a unique instrument (all 28 million of them).Qvisualisewould then run out of memory trying to dealwith them all. Have added code to properly deal with the new format, in
qprofiler, andperhaps more importantly, added some limits to the size of collections that can be handled by
qvisualise.New fastq header:
@SRR14585604.8 A00805:41:HMJJWDRXX:1:1101:16929:1000 length=101which looks like the NCBI header as described here
What
qprofilerwill now do for this is to count the number of spaces in the header. If there are 2, it will just send the middle part to be analysed. For any other number of spaces, the header will be treated as it was previously.qvisualisewill now run some checks on the input xml file size along with the supplied Xmx parameter, and log if it thinks a memory issue may be encountered.It will also ignore any xml element that has more than 100000 items in it.
Type of change
How Has This Been Tested?
Additional unit test have been added.
Code has been run against offending fastq file and has produced satisfactory results
Are WDL Updates Required?
Nope
Checklist: